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(54) Process in an information processing system for compaction and replacement of phrases. 

(57) An information processing system is disclosed which provides a writer with acceptable replacement 
phrases to substitute for trite phrases in a manuscript text. The replacement phrases are grammatically 
equivalent to the trite phrases and can be immediately inserted into the text without further alteration. 
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Background of the Invention 

1. Technical Field 

5 The invention disclosed broadly relates to information processing systems and more particularly relates 

to improvements to word processing systems. 

2. Background Art 

10 The test of effective writing is whether the reader is left with an accurate understanding of the writer's in- 

tended meaning. Each word or phrase should contribute to the accurate flow of information from the writer to 
the reader. However, even practiced writers commit the error occasionally using tired, over-familiar words and 
phrases which have been used by many other writers and which have come in time to mean little. An example 
is using the phrase "wearing two hats" to convey the meaning of having two jobs or capacities. Another example 

15 is substituting the phrase "in no uncertain terms" for the single word "clearly." These errant usages are variously 
referred to as cliches, trite metaphors, set phrases, pseudo-jargon, popularized technicalities or vogue words. 
They all have the common fault of not saying what the writer means but only approximating his thought at best, 
and possibly giving the reader an unintended message that the writer is a lazy thinker. 

It would be useful to provide a mechanism for automatically scanning the text of a manuscript on command, 

20 searching for trite phrases, highlighting the offending passage and suggesting to the writer acceptable alter- 
natives which can be substituted into the text. A medium which suggests itself for this mechanism is the modern 
word processor and its associated dictionary-based features. Existing word processors include the dictionary- 
based feature of checking for spelling errors on command by scanning the text of a manuscript stored in a stor- 
age medium, comparing each word in the text with a stored dictionary of correctly spelled words, highlighting 

25 isspelled words in the text, and suggesting to the writer the correctly spelled form of the word. One example 
of this spell-checking feature in a word processor is described in USP 4,136,395 to Kolpek, et al., entitled "Sys- 
tem for Automatically Proofreading a Document," assigned to IBM Corporation. Another dictionary-based fea- 
ture found in existing word processors is the display of a list of synonyms on command. This is done by scanning 
the text of a manuscript stored in a storage medium, comparing a word selected by the writer from the text 

30 with a stored dictionary of synonyms, and suggesting to the writer acceptable synonyms for the selected word. 
One example of this synonym generation feature in a word processor is described in USP 4,384,329 to Ro- 
senbaum, et al., entitled "Retrieval of Related Linked Linguistic Expressions Including Synonyms and Anto- 
nyms," assigned to IBM Corporation . 

However, the problem of automatically displaying suggested acceptable phrases to replace trite phrases 

35 in a manuscript text cannot be solved with the principles used in existing dictionary-based word processing 
features, because of the need to make the replacement phrase grammatically equivalent to the trite phrase 
which is to be replaced. Pronouns in the replacement phrase must grammatically agree in person, gender and 
number with their antecedents in the original sentence. Verbs in the replacement phrase must grammatically 
agree in person and number with the subject of the original sentence. Grammatic agreement means to corre- 

40 spond in form. For example, if the subject in the original sentence is third person, plural, then the verb in the 
replacement phrase for that sentence must also be third person, plural. 

For a specific example, the sentence "I am not about to climb that mountain." contains the trite phrase 
"am not about to." A more accurate expression of the writer's meaning is stated by substituting the replacement 
phrase "do not intend to" for the trite phrase. However, if the original sentence were "He is not about to climb 

45 that mountain.", then in order to be grammatically equivalent, the sentence with the replacement phrase would 
have to start "He does not intend to...." The change in the person of the pronoun from the first person form "I" 
to the third person form "He" requires changing the verb in the trite phrase from "am" to "is" and requires chang- 
ing the verb in the replacement phrase from "do" to "does." To be grammatically correct, a verb in a sentence 
must agree with the person of its subject. 

so |f the example is carried one step further, the number of the subject can be changed from singular to plural. 

Thus, if the third person singular pronoun in the sentence "He is not about to climb that mountain." is changed 
to the third person plural "They," the verb "is" in the trite sentence is changed to "are," as in 'They are not about 
to climb that mountain." The sentence with the grammatically equivalent replacement phrase would then start 
"They do not intend to...." Thus, to be grammatically correct, a verb in a sentence must agree with the number 

55 as well as the person of its subject. 

The problem of maintaining grammatical equivalence between the replacement phrase and the trite phrase 
it replaces becomes further complicated by the requirement that the tense of the verb in the replacement phrase 
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tense of the verb can be changed from present to past tense. Thus, if the third person, plural, present tense 
verb "are" in the trite sentence "They are not about to climb that mountain." is changed to the past tense verb 
"were," then the third person, plural, present tense verb "do" must be changed to the past tense "did" for the 
sentence with the replacement phrase "They did not intend to...." Thus, to be grammatically correct, a verb in 
5 the replacement phrase must agree with the tense, as well as the number and person of the verb in the trite 
phrase being replaced. This characterizes some of the problems facing prior art. 

Objects of the Invention 

10 It is therefore an object of the invention to provide an process for substituting grammatically equivalent 

replacement phrases for source phrases in a text. 

It is a further object of the invention to store source phrases and corresponding replacement phrases in 
a computer memory, in a more compact manner than has been available in the prior art. 

15 Summary of the Invention 

The Embodiments of the invention are defined in claims 1 to 15. 

Objects, features and advantages are accomplished by the information processing system invention dis- 
closed herein. The invention provides a writer with acceptable replacement phrases to substitute for trite phras- 

20 es in a manuscript text. The replacement phrases are grammatically equivalent to the trite phrases to be re- 
placed and can be immediately inserted into the text without further alteration. 

Each trite phrase for which a replacement is desired, is paired with its corresponding replacement phrase. 
A family of trite phrases and its corresponding family of replacement phrases are represented by a phrase- 
pair expression which symbolizes the phrases in all of their parts of speech (number, gender, tense, etc.). Each 

25 phrase-pair expression includes a source phrase segment representing the family of trite phrases, which con- 
tains a variable source word element and a constant source word element. Each phrase-pair expression also 
includes a replacement phrase segment containing a variable replacement word element and a constant re- 
placement word element. A plurality of these phrase-pair expressions are stored in the memory of a word proc- 
essing computer, each expression representing a different family of paired trite and replacement phrases. 

30 The variable source word element in a phrase-pair expression, symbolically represents all of the parts of 
speech for a verb (for example) in the trite phrase. The variable source word element serves as an address 
pointer to a first table called the source table stored in the memory, containing all of the forms of the symbolically 
represented verb. These verb forms are called values of the variable source word element. The plurality of 
source verb forms are arranged into a plurality of ranks in the source table, having a grammatically significant 

35 sequence. A plurality of source tables is stored in the memory, each table corresponding to a different family 
of verb forms, pronoun forms, and other parts of speech. 

The variable replacement word element in a phrase-pair expression symbolically represents all of the parts 
of speech for a corresponding replacement verb (continuing the example) in the replacement phrase. The va- 
riable replacement word element serves as an address pointer to a second table called the replacement table 

40 stored in the memory, containing all of the forms of the symbolically represented replacement verb. These verb 
forms are called values of the variable replacement word element. The plurality of replacement verb forms are 
arranged into a plurality of ranks having a grammatically significant sequence with the verb form in each rank 
of the replacement table being grammatically equivalent to the verb form in a corresponding rank of the source 
table. A plurality of replacement tables is stored in the memory, each table corresponding to a different family 

45 of verb forms, pronoun forms, and other parts of speech. 

The writer at the keyboard of a word processing computer drafting his manuscript text, enters strings of 
alpha-numeric characters which comprise an input word stream. That input word stream can be stored in the 
memory of the computer or on the disk storage of the computer for further editing operations. Whether the 
manuscript text is read from the memory, from the disk storage or directly from the keyboard, the resultant 

so strings of alpha-numeric characters can be considered the input word stream which is operated on by the in- 
vention. 

In response to a command entered by the writer, the execution unit of the computer will compare first target 
words from the input word stream with the constant source word elements in each of the plurality of phrase- 
pair expressions. 

55 The constant source word element in a phrase-pair expression is the constant portion of the trite phrase 

which does not change when the phrase is used in various parts of speech. This contant portion can be a single 
word or a sequence of words in an alpha-numeric string which is compared with the alpha-numeric strings in 
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tually a match is found between the constant portion of a trite phrase in one of the phrase-pair expressions 
and a target word or sequence of words in the input word stream. 

The variable source word element in the trite phrase portion of the matched phrase-pair expression is then 
used as an address pointer to the source table containing all of the forms of the symbolically represented verb 

5 in the trite phrase. This source table is accessed and each of the verb forms in the table are compared with a 
second target word in the input word stream which is proximate to the first target words found to match the 
constant portion of the trite phrase. If a match is found with one of the verb forms in the source table, then an 
actual trite phrase has been located in the input word stream. The matched words will be highlighted on the 
display screen of the computer as a trite phrase which is a candidate for replacement. 

10 The invention then proceeds to generate the grammatically equivalent replacement phrase by identifying 

the grammatically significant rank of the trite verb form in the source table which is equal to the matched, sec- 
ond target word. 

Then the replacement table is accessed. This is accomplished by using the variable replacement word ele- 
ment in the replacement phrase portion of the matched phrase-pair expression as an address pointer to the 

is replacement table containing all of the forms of the symbolically represented verb in the replacement phrase. 
This replacement table is accessed and the replacement verb form in the rank of the table which corresponds 
to the grammatically significant rank previously identified in the source table, is selected as the replacement 
verb for the replacement phrase. 

An output replacement phrase is then constructed from the replacement verb selected from the replace- 

20 ment table and the constant replacement word element in the matched phrase-pair expression. The constant 
replacement word element in the phrase-pair expression is the constant portion of the replacement phrase 
which does not change when the phrase is used in various parts of speech. This constant portion can be a 
single word or a sequence of words in an alpha-numeric string, to which is added the replacement verb to form 
the output replacement phrase. 

25 The output replacement phrase is then displayed on the display screen to the writer. The writer can then 

decide whether he wants to substitute the replacement phrase for the highlighted trite phrase. If he desires to 
make the substitution, the writer enters a command at the keyboard and the alpha-numeric string comprising 
the replacement phrase is substituted for the first and second target words in the input word stream. 

In this manner, replacement phrases which are grammatically equivalent to the trite phrases to be replaced, 

30 can be immediately inserted into the text without further alteration. The invention also provides a significant 
compaction of the trite phrases and the replacement phrases for storage in the memory. 

Brief Description of the Figures 

35 These and other objects, features and advantages of the invention will be more fully appreciated with ref- 

erence to the accompanying figures. 

Fig. 1 is a system block diagram of the first embodiment of the invention. 

Fig. 2 shows the system block diagram of Fig. 1 , during the operation of the invention on an input word 
stream. 

40 Fig. 3 is a flow diagram of the sequence of operational steps carried out by the first embodiment of the 

invention. 

Fig. 4 is a flow diagram of a second embodiment of the invention, illustrating the phrase compaction proc- 
ess. 

Fig. 5 is a flow chart of the second embodiment of the invention, illustrating the process for decoding the 
45 phrase tables. 

Fig. 6 is a conceptual diagram of the second embodiment of the invention, showing the overall process. 
Fig. 7 is a functional block diagram of the second embodiment of the invention, illustrating the host data 
processing system. 

Fig. 8 is a system block diagram of the data processing system shown in Fig. 7. 
so Fig. 9 is a logical block diagram showing the apparatus of the memory 150 in Fig. 8, including several des- 

ignated data areas and functional programs controlling the operation of the system. 

Description of the Preferred Embodiments 

55 The first embodiment of the invention, shown in Figs. 1 , 2 and 3, is a simplified version of the second em- 

bodiment of the invention shown in Figs. 4-9. The second embodiment of the invention makes use of the prin- 
ciples of operation described for the first embodiment of the invention, and adds to those principles some ad- 
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The information processing system shown in Fig. 1 provides a writer with acceptable replacement phrases 
to substitute for trite phrases in a manuscript text. In accordance with the invention, the replacement phrases 
which are produced are grammatically equivalent to the trite phrases to be replaced and can be immediately 
inserted into the text without further alteration. The system shown in Fig. 1 is a word processing computer 

5 wherein an input word stream 12 of natural language text is input at the keyboard 10 by the writer in the process 
of drafting his manuscript text. The writer enters strings of alpha-numeric characters at the keyboard 10, which 
comprise the input word stream 12. The input word stream 12 can be stored in the memory 14 or in the disk 
storage 17 of the computer for further editing operations. Whether the manuscript text is read from the memory 
14, the disk storage 17 or whether it is directly input at the keyboard 1 0, the resultant strings of alpha-numeric 

10 characters can be considered to be the input word stream 12 which is operated upon by the invention. The 
system of Fig. 1 further includes an execution unit 16 for executing instructions to process natural language 
text and to execute the instructions necessary to carry out the process of the invention. The system of Fig. 1 
also includes an output display unit 18 for displaying an output word stream 20 of natural language text which 
results from the operation of the invention. 

15 In accordance with the invention, each trite phrase for which a replacement is desired, is paired with its 
corresponding replacement phrase. In the example shown in Fig. 2, the input word stream 12 includes the sen- 
tence 'We are not about to climb that mountain." which contains the trite phrase "are not about 1o" which can 
be replaced by the sentence beginning 'We do not intend to climb that mountain." which contains the replace- 
ment phrase "do not intend to." In accordance with the invention, a family of trite phrases and its corresponding 

20 family of replacement phrases are represented by a phrase-pair expression 28 which symbolizes the phrases 
in all of their parts of speech (number, gender, tense, etc.). Each phrase-pair expression 28 includes a source 
phrase segment representing the family of trite phrases, which contains a variable source word element 30 
and a constant source word element 32, as is shown in Fig. 1. Each phrase-pair expression also includes a 
replacement phrase segment containing a variable replacement word element 34 and a constant replacement 

25 word element 36, as shown in Fig. 1. A plurality of these phrase-pair expressions 28, 28\ 28", etc. are stored 
at addressable locations in the memory 14, as is represented by step 60 of the flow diagram of Fig. 3. Each 
phrase-pair expression 28 represents a different family of paired, trite and replacement phrases. 

The variable source word element 30 of Fig. 1 symbolically represents all of the parts of speech for a verb 
(for example) in the trite phrase. (The variable source word element 30 can also symbolically represent all of 

30 the parts of speech for pronouns, pronoun-verb combinations, verb phrases, regular verb endings, and other 
grammatical elements and combinations.) The variable source word element 30 serves as an address pointer 
to a first table called the source table 38 in Fig. 1, stored at an addressable location in the memory 14 and 
containing all of the forms of the symbolically represented verb. These verb forms are called values of the va- 
riable source word element 30. The plurality of source verb forms are arranged into a plurality of ranks 40, 42, 

35 44, 46 and 47 in the source table 38 of Fig. 1. These ranks have a grammatically significant sequence, as can 
be seen in Table I and in Fig. 2. Table I analyzes the source phrase "are not about to" and its preceding pronoun 
"we" into its various parts of speech. The verb "are" in the source phrase is the first person, plural, present 
tense form of the verb "be." The forms of the verb "be" in Table I are "am," are," "is," "was," and "were." The 
grammatical characteristics of each of these verb forms is displayed in Table I. The verbs shown in Table I are 

40 the various source verb forms of the variable source word element 30. The verbs are arranged into the first, 
second, third, fourth and fifth ranks shown in Table I which correspond to the source table 38 ranks 40, 42, 
44, 46 and 47, respectively, as shown in Fig. 2. A plurality of source tables 38, 38' and 38" is stored at address- 
able locations in the memory 14, each table corresponding to a different family of verb forms, pronoun forms, 
and other parts of speech, as is represented by step 62 of the flow diagram of Fig. 3. 

45 



50 



55 



EP 0 685 801 A1 



TABLE I 

Source phrase = "We are not about to" 
Source constant = "not about to" 

Source variable = the various forms of the verb "be" 



10 

Pronoun :Verb: Verb Form of "be" 



15 


I 


am ; 


first person, singular, present tense 


: 1st 




We 


are : 


first person, plural, present tense : 




2nd 


20 




are 


second person, singular, present tense 


: 2nd 


25 


i i icy 


* are : 


third person, plural, present tense 


: 2nd 


He : 


is 


third person, singular, present tense : 


3rd 


30 


one 


is : 


third person, singular, present tense : 


3rd 






is : 


third person, singular, present tense : 


3rd 


35 


1 : 


was : 


first oerson ^innubr n ^ct tar«« ^ 

^^-i jy« i , oil lyuiar , pdsi xense ; 


4 th 




He : 


was : 


third person, singular, past tense : 


4th 


40 


She : 


was : 


third person, singular, past tense 


4th 


45 


It : 


was : 


third person, singular, past tense 


4th 


You : 


were : 


second person, singular, past tense : 


5th 


50 


We 


were : 


first person, plural, past tense : 


5th 




They : 


were : 


third person, plural, past tense : 


5 th 



55 
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a corresponding replacement verb (continuing with the same example) in the replacement phrase. (The vari- 
able replacement word element 34 can also symbolically represent all of the parts of speech for pronouns, pro- 
noun-verb combinations, verb phrases, regular verb endings, and other grammatical elements and combina- 
tions.) The variable replacement word element 34 serves as an address pointer to a second table called the 

5 replacement table 48 shown in Fig. 1, which is stored at an addressable location in the memory 14 and which 
contains all of the forms of the symbolically represented replacement verb. These verb forms are called values 
of the variable replacement word element 34. The plurality of replacement verb forms are arranged into a plur- 
ality of ranks 40\ 42\ 44', 46' and AT shown in Fig. 1 . These ranks have a grammatically significant sequence 
with the verb form in each rank of the replacement table 48 being grammatically equivalent to the verb form 

10 in a corresponding rank of the source table 38. This is illustrated in Table II and in Fig. 2. Table II shows the 
replacement phrase "do not intend to" and its preceding pronoun "we" analyzed into its various grammatical 
forms. The verb forms for the verb "do" include the verbs "do, " "does, and "did." The various parts of speech 
for these verb forms are shown in Table II. These various parts of speech are ranked in the same order as the 
corresponding rankings for the parts of speech shown for the source phrase in Table I. The first, second, third, 

15 fourth and fifth ranks shown in Table II occupy ranks 40', 42', 44', 46' and 47', respectively, of the replacement 
table 48 shown in Fig. 2. A plurality of replacement tables 48, 48', 48", etc. is stored at addressable locations 
in the memory 14, each table corresponding to a different family of verb forms, pronoun forms and other parts 
of speech, as is represented by step 64 in the flow diagram of Fig. 3. 

20 



25 



30 
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45 
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TABLE II 

Replacement phrase = "We do not intend to" 
Replacement constant = "not intend to" 

Replacement variable = the various forms of the verb "do" 



Pronoun : Verb : Verb Form of "do" 



15 


I : 


do 


: first person, singular, present tense 


: 1st 




We 


do 


: first person, plural, present tense 


: 2nd 


20 


You : 


do 


: second person, singular, present tense 


: 2nd 


25 


They 


do 


: third person, plural, present tense 


: 2nd 


He 


does 


: third person, singular, present tense 


: 3rd 


30 


She : 


does 


: third person, singular, present tense 


: 3rd 




it : 


does 


: third person, singular, present tense 


: 3rd 


35 


I : 


did 


: first person, singular, past tense 


: 4th 




He : 


did 


: third person, singular, past tense 


: 4th 


40 


She : 


did 


: third person, singular, past tense 


: 4th 


45 


It : 


did 


: third person, singular, past tense 


: 4th 




You : 


did 


: second person, singular, past tense 


: 5th 


50 


We : 


did 


: first person, plutal, past tense 


: 5th 




They 


did 


: third person, plural, past tense 


: 5th 



55 ~ 

In response to a command entered by the writer at the keyboard 1 0, the execution unit 1 6 of the computer 
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source word elements 32 in each of the plurality of phrase-pair expressions 28, 28' and 28", as is represented 
by step 66 of the flow diagram of Fig. 3. The constant source word element 32 in the phrase-pair expression 
28 is the constant portion of the trite phrase which does not change when the phrase is used in various parts 
of speech. This constant portion can be a single word or a sequence of words in an alpha-numeric string which 

5 is compared with the alpha-numeric strings in the input word stream 12. Each phrase-pair expression 28, 28', 
28", etc. in the memory 14 is accessed for the comparison and eventually a match is found between the con- 
stant portion of a trite phrase in one of the phrase-pair expressions 28 and the target word or sequence of words 
24 in the input word stream 12. This is represented in step 68 of Fig. 3. As is shown in the example of Fig. 2, 
the first target string 24 is the phrase "not about to" which is matched with the source constant 32 in the phrase- 

10 pair expression 28. 

The variable source word element 30 in Fig. 1 in the trite phrase portion of the matched phrase-pair ex- 
pression 28, is then used as an address pointer to the source table 38 which contains all of the forms of the 
symbolically represented verb in the trite phrase. This source table 38 is accessed as is represented by step 
70 of the flow diagram of Fig. 3. Each of the verb forms in the source table 38 is then compared with a second 

15 target word or sequence of words in a string 26 in the input word stream 12, which is proximate to the first 
target words 24 found to match the constant portion of the trite phrase. This is represented by step 72 in the 
flow diagram of Fig. 3. If a match is found with one of the verb forms in the source table 38, then an actual 
trite phrase 22 has been located in the input word stream 12. In the example shown in Fig. 2, the second target 
string 26 is the word "are" which matches with the verb form "are" in the second rank 42 of the source table 

20 38. The matched phrase 22 will be highlighted on the display screen 1 8 of the computer as a trite phrase which 
is a candidate for replacement. 

The invention then proceeds in step 74 of the flow diagram of Fig. 3, to generate the grammatically equiv- 
alent replacement phrase by identifying the grammatically significant rank of the trite verb form in the source 
table 38 which is equal to the matched, second target word 26. This is the second rank 42. 

25 Then, the replacement table 48 is accessed, as is represented by step 76 in the flow diagram of Fig. 3. 

This is accomplished by using the variable replacement word element 34 in the replacement phrase portion of 
the matched phrase-pair expression 28 of Fig. 1 , as an address pointer to the replacement table 48 which con- 
tains all of the forms of the symbolically represented verb in the replacement phrase. The replacement table 
48 is accessed and the replacement verb form "do" in the second rank 4Z of the table 48 is the replacement 

30 verb form which corresponds to the grammatically significant second rank 42 which was previously identified 
in the source table 38. Step 78 of the flow diagram of Fig. 3 represents the selection of the replacement verb 
for the replacement phrase. 

Fig. 1 shows that an output replacement phrase 50 consisting of a replacement value 52 and a replacement 
constant 54, is constructed from the replacement verb selected from the rank 42' of the replacement table 48 

35 and the constant replacement word element 36 in the matched phrase-pair expression 28. The constant re- 
placement word element 36 in the phrase-pair expression 28 is the constant portion of the replacement phrase 
which does not change when the phrase is used in its various parts of speech. This constant portion can be 
a single word or a sequence of words in an alpha-numeric string, to which is added the replacement verb from 
the replacement table 48, to form the output replacement phrase 50, as represented by step 80 of the flow 

40 diagram of Fig. 3. 

The output replacement phrase 50 is then displayed on the display screen 18 to the writer. The writer can 
then decide whether he wants to substitute the replacement phrase 50 for the highlighted trite phrase 22. If 
he desires to make the substitution, the writer enters the command at the keyboard 10 and the alpha-numeric 
string comprising the replacement phrase 50 is substituted for the first and the second target words or multiple 

45 word strings 24 and 26 in the input word stream 12. 

Alternately, the substitution of the replacement phrases into the manuscript text can be done automatically 
without requiring the writer's further intervention. The output word stream 20 can then be directly stored as a 
modified manuscript text in the memory 14 or disk storage 17. 

In this manner, replacement phrases which are grammatically equivalent to the trite phrases to be replaced, 

so can be immediately inserted into the text without further alteration. An additional advantage of the invention 
is the significant data compaction which is achieved for the trite phrases and the replacement phrases being 
stored in the memory 14. By representing the family of trite phrases and their corresponding replacement 
phrases with a phrase-pair expression 28, a source table 38 and the replacement table 48, a significant re- 
duction in the memory space required for the storage of such phrases, is achieved. 

55 The second embodiment of the invention shown in Figs. 4-9 builds upon the principle of operation of the 

first embodiment which has been described in conjunction with Figs. 1-3. Improvements in the second em- 
bodiment of the invention include the use of match terms to identify the source phrase in the input word stream 
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to increase the speed of comparison with the input word stream. The second embodiment of the invention is 
described as follows. 

Description of the Second Embodiment of the Invention 

5 

The second embodiment of the system for compaction and replacement of phrases is shown in Figs. 4-9. 
The system provides for the automatic replacement of words and phrases in specific linguistic context such 
as automatic translation, the replacement of improper grammatical phrases, and term substitution with defi- 
nitions. Linguistic compaction is achieved by the invention through the use of symbolic expressions and gram- 
10 matical completeness is achieved through the use of relational tables. 

Description of Compaction and Decoding Process 

Compaction and decoding of phrases consists of three stages: 
is 1 . linguistic codification 

2. phrase compaction 

3. phrase decoding 

1. Procedure for Linguistic Codification of the Phrases 

20 

The linguistic codification of the phrases is best done manually to create files that can be understood and 
maintained easily. The linguistic codification requires recognizing the elements of the language with which cor- 
respondences need to be established and defining the tables used to generate conjugations and other linguistic 
variants. 

25 One begins with a compendium of cliches and trite phrases. Many of the phrase are associated with a pre- 

ferred usage. For example, "have the ability to" can be more concisely stated as "can" or "be able to." The 
phrase file is the compilation of phrase-pair expressions 28, encoded to allow variants of a word by reference 
to list names. The "at" symbol (@) is used as a list name symbol at the beginning of those words in the phrase 
file which have reference lists of variants or alternate parts of speech for the word in source tables 38 or in 
30 replacement tables 48. A typical phrase file entry is: 
@ have the ability to = @can„ @be able to 
This entry is a phrase-pair expression 28 which references the lists of words with the names " @ have," 
@ can," and "@ be" each consisting of four entries: 



@have 


@can 


@ be 


have 


can 


am, are, be 


has 


can 


is 


had 


could 


was, were 


having 


## 


being 



The @ have list is a source table 38 and the @can and @be lists are replacement tables 48. The first line 
45 of each list contains the infinitive/present form of the verb, the second line the third person form, the third line 
has the past tense forms, and the fourth line the present participle form. A null entry is indicated by a double 
pound sign. The mapping of the phrase to its replacement is accomplished on the basis of the correspondences 
of these lists. 

Phrase-pair expression 28 entries of the phrase file consist of two parts the source <PHRASE> and the 
so <REPLACEMENT> phrase separated by an equal sign. The following Backus-Naur Form (BNF) description 
defines the format of the entries. Quoted strings represent literals; lowercase names represent terminal sym- 
bols. (See Naur, P., "Revised Report on the Algorithmic Language ALGOL 60," Communications of the ACM, 
6 (January 1963), 1-17)) . ~ 

The format of an ENTRY is< PHRASE> "=" <REPLACEMENT>, where <PHRASE> is <WORD > I 
55 <PHRASE>""<WORD> 

or < STRLIST> | < PHRASE> " " <STRLIST>, 
REPLACEMENT >is <WORD > | < REPLACEMENT " "< WORD >, 
< WORD> is (8) list 
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or string 
orstring+ @ list 
orstring(ending) and 
< STRLIST > is string or strlist. 
5 The symbol "=" is the "equal sign" character. The symbol " I " means "or." The symbol " " is a blank char- 

acter. 

As an example, the source phrase "@ have low I high value(s)" will match against the following alpha- 
numeric strings in the input word stream: 
have low value has low value had low value having low value 
10 have low values has low values had low values having low values 
have high value has high value had high value having high value 
have high values has high values had high values having high values 

Similarly, the use of ending lists allows a very compact form in the phrase file to match a large number 
of input words. The list "@elr," for example, contains the endings for regular verbs and allows "work+@elr" to 
15 match against "work," "works," "worked," and "working." 

2. Phrase Compaction Process 

The phrase compaction is illustrated in Fig. 4. In step 101 the linguistically codified phrase-pair expres- 

20 sions 28 are read by the program. The word in the constant source word element 32 which is to be used as an 
index key (the focus word) is identified and isolated in step 102. In step 103, the invariant attributes of the key 
are used to construct a series of bit patterns that are highly characteristic of the key. The bit patterns for all 
the keys of the file are superimposed in a hash screen table. Although the superimposed bit patterns are not 
unique for any particular key, the presence of all the bits for a particular term indicates that the term has a 

25 high probability of being a key in the phrase table. (Examples of suitable hashing techniques are given in Bloom, 
B. H. "Space/Time Trade-Offs in Hash Coding with Allowable Errors," Communications of the ACAM , 13(7), 
1977, pages 422-426 and also in Murray, D. M., "A Scatter Storage Scheme for Dictionary Lookups," Journal 
of Library Automation , Vol . 3/3, September 1970, pp . 173-201.) 

The index key is also used in step 104 to sort the phrase-pair expressions in the phrase file, to organize 

30 it for efficient retrieval. In step 105 the source phrase portion of phrase-pair expressions 28 are decomposed 
into terms used for matching during the decoding stage. The match terms have to include a description of posi- 
tional constraints (such as adjacency of words) and an indication of special matching requirements (references 
to ending lists, alternate terms, etc.). Step 105 also encompasses the encoding of the match terms based on 
the frequencies of the characters in the text. Frequency-based compaction reduces storage requirements sig- 

35 nif icantly, although not as much as linguistic codification. 

Finally, in step 106, the encoded phrase file and the hash screen is written to an output file which is used 
during the decoding stage. 

3. Process for Decoding Phrase Tables 

40 

Fig. 5 illustrates the steps required for decoding the phrase-pair expressions 28 in the phrase file. The 
purpose of the process is to identify target phrases within the input word stream that match against the source 
phrase portion of phrase-pair expressions 28 and provide replacement alpha-numeric strings which can be syn- 
onyms, foreign language translations, or grammatically equivalent replacement phrases which can include an- 
45 notations. 

Step 107 is the initial step of scanning the input word stream and identifying words and punctuation. Each 
word pair and word of the input word stream is hashed in step 108 using the same procedure that was used 
during the creation of the compact phrase tables. In step 1 09 the hash codes for the word pairs or word from 
the input word stream are tested against the hash screen table. If all the bits of the hash code are found to be 
50 "on" in the hash screen table, the word pair or word from the input word stream is presumed to be in the phrase 
file and processing continues, otherwise processing continues with words from the input word stream looking 
for a match. 

Once the hash screen has been successfully matched, step 110 accesses the compacted phrase file and 
reads the records containing the key term. A character-by-character comparison is used in step 111 to deter- 
55 mine whether the key term in the phrase file actually matches the word from the input word stream. The word 
may not match because of a false "collision" in the hash screen table; in this case the process goes back to 
step 108 and tries another word from the input word stream. 
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This requires referencing the source table 38 containing the term lists in step 112 and matching source word 
element values by suffix substitution. The final matching procedure in step 113 consists of applying rules for 
matching adjacent words of the phrase. If any matching requirement fails, processing resumes at step 108. 
Finally, when all terms have matched properly, the replacement phrase associated with the matching 

5 phrase is uncompacted. Processing resumes from step 108 until all the input word stream has been processed. 
Fig. 6 illustrates the process for converting the linguistically encoded phrase file into the bit screen table 
and match terms that make it possible to match efficiently against input text. Box 115 identifies a linguistically 
encoded phrase where the symbolic variable @BE1 represents forms of the verb "be" and the variable @DO 
represents forms of the verb "do." The phrase is encoded as an equation where the left side specifies the 

10 matching constraints and the right side specifies the corresponding replacement. Box 116 identifies the word 
pair "not about" as the index key. The word "about" is preceded by an asterisk to indicate that it is the word 
indexed. Box 117 contains the entries defining the symbolic variables. The entries of each symbolic variable 
have a one-to-one correspondence based on grammatical constraints (person, in this case). Box 119 contains 
the match terms derived from the left side of the phrase in box 115. Box 119 contains the relative word numbers 

15 and values required to effect a successful match; the replacement used after a successful match is given within 
parentheses next to the index word. 

The decoding process starts by isolating the words of the input word stream, hashing them, and testing 
the hash codes against the bit screen. When the match against the bit screen is successful, the rules for the 
index word are retrieved and applied against the input sentence. Box 120 in Fig. 6 identifies a word pair which 

20 has been successfully matched against the bit screen in box 118. Since the word "about" is the current word, 
the words around it are given relative numbers as indicated under box 120. The match terms in box 119 are 
applied to determine if the phrase matches. These terms are ordered so that the ones that require the least 
amount of effort for matching are checked first. First, the process checks to the left of the word "about" for the 
word "not" which is in the "-1 " position (i.e., one word to the left of "about"). The next check is for the word "to" 

25 to the right of "about," and the last check is for the symbolic variable "@BE1 " two words to the left of "about." 
Checking symbolic variables involves consulting their definition (box 117) and keeping track of the relative pos- 
ition of the match. In this case the word "is" matches the second line of the definition for @BE1 . 

Having matched successfully, the process generates the replacement phrase from the parenthetical ex- 
pression in box 119. Since this expression has a symbolic variable, the process retrieves a term corresponding 

30 to the same relative position as the term that matched. The second word of the "@DO" variable is "does" and 
the replacement phrase is "does not intend to" which is given in box 122. 

The system for compaction and replacement of phrases finds its preferred application in a host data proc- 
essing system such as that shown in Figs. 7, 8 and 9. Fig. 7 is a system diagram of the host data processing 
system. The host data processor 130 is connected through a terminal controller 134 to a plurality of worksta- 

35 tions 136, 1 36A and 136B. The host data processor 130 is also connected to a bulk storage unit 133. The system 
configuration of Fig. 7 can be embodied with an IBM System/370-type host data processor 130, such as an 
IBM 3081 processor connected through an IBM 3274 terminal controller 134 to an IBM 3270 workstation 136. 
Details of such a configuration can be found, for example, in U.S. Patent 4,271 ,479 to Cheselka, etal., entitled 
"Display Terminal With Modularly Attachable Features," which is assigned to the IBM Corporation. A more de- 

40 tailed description of the host data processor 130 can be found in IBM System/370 Principles of Operations , 
Order No. GA22-7000, published by the IBM Corporation, 1981. The host data processor 130 can employ an 
operating system such as the Virtual Machine/Conversational Monitor System (VM/CMS) which is described 
in IBM Virtual Machine Facility/370 Introduction, IBM Systems Library, Order No. GC20-1800, published by 
the IBM Corporation, 1981. 

45 The system shown in Fig. 7 is described in greater detail in Fig. 8 where it is seen that the host data proc- 
essor 130 has a primary bus 148 which interconnects the channel 146, the memory 150, the execution unit 
1 52 and the storage controller 1 54. The bulk storage 1 33, which can be a large capacity disk drive such as an 
IBM 3380, is connected to the storage controller 154. The channel 146 is connected to a plurality of input/output 
terminals 134A. The channel 146 is also connected to the terminal controller 134. The terminal controller 134 

so includes a screen buffer 140 which is connected to the display screen 137, a processor 142 which is connected 
to the screen buffer 140 and also to the keyboard 1 35, and the communications adapter 144 which is connected 
to the processor 142. The communications adapter 144 provides the communications interface with the chan- 
nel 146 of the host data processor 130. The workstation 136, which includes the display screen 137 and the 
keyboard 135, is also shown in Fig. 8, as it is related to the terminal controller 134. In addition, the channel 

55 146 includes an output to the printer 156. 

A user at the workstation 136 will access the system by inputting commands and working text at the key- 
board 135. This information is processed by the processor 142 which writes into the local screen buffer 140 
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the keyboard 135, the processor 142 alerts the communications adapter 144 to transfer those portions of the 
working text which have been changed in the screen buffer 140, to the channel 146 of the host data processor 
1 30. The information received by the channel 146 is transferred to the bus 148. Conversely, when information 
is provided by the bulk storage 133 through the information controller 154 to the bus 148, or by the execution 

5 unit 152 to the bus 148, or by the memory 150 to the bus 148, that information is transferred by the channel 
146 to the communications adapter 144 at the terminal controller 1 34 for display on the display screen 1 37. 

The random access memory 150 in the host data processor 130 includes a number of data areas and func- 
tional programs for operating with the data input into it through the bus 1 60 which is connected to the bus 148. 
Fig. 9 is a logical block diagram showing the apparatus of the memory 150 including several designated data 

10 areas and functional programs controlling the operation of the system. The instructions in each of the func- 
tional programs are executed by the execution unit 152. The memory 150 is divided into a plurality of substan- 
tially identical partitions 200, 200Aand 200B which respectively perform the functions for workstations 136, 
1 36A and 1 36B of Fig. 7. The VM/CMS operating system program 70 in the memory 150 provides the overall 
control for the operation of the host data processor 1 30 and provides the coordination of the memory partitions 

15 200, 200A and 200B so that the users of the respective workstations 136, 136Aand 136B appear to have seem- 
ingly separate and independent IBM System/370 computing systems. See the above cited VM/CMS reference 
for further details. The file access method 172 coordinates transfers of data between the bulk store buffer 174 
in the memory 150 and the storage controller 154 which interfaces with the bulk storage 133. The printer exec- 
utive 175 controls printer 156 operations through the channel 146. 

20 Fig. 9 shows the apparatus of memory 150 during the decoding process when phrases within an input 

string are matched against the phrase file and replacement strings are provided which can be synonyms, for- 
eign language translations, or replacement phrases which can also include annotations. During a text proc- 
essing session, the operator at the workstation 116 inputs words and phrases at keyboard 1 35 and the terminal 
controller 134 transfers that text to the host data processor 130 where it is stored in the memory 150 in the 

25 working text buffer 178 where it can be operated upon by the word processing executive 176 to carry out con- 
ventional word processing operations. When the operator indicates by a control input at the keyboard 135 that 
phrase substitution is desired, input phrase strings are transferred from the working text buffer 1 78 to the input 
phrase register 182. The hashing processor 180 operates on the phrases in the input phrase register 182 and 
provides hash-encoded values for the input phrase to the hash-encoded input phrase register 184. The hash- 

30 encoded values for the input in 184 are then compared with the hash bit screen table (for the source file) in 
the buffer 188 by means of the comparator 186, in order to identify index or focus words. When a successful 
comparison is achieved by the comparator 186, the rules processor 190 performs a matching operation check- 
ing to determine if the adjacent words in the phrase in the input phrase register 182 satisfy the replacement 
rules. If the rules processor satisfies the matching of terms between the input phrase and the source file 

35 phrase, then the symbolic variable processor 192 selects the correct part of speech for the symbolic variables 
which occur in the replacement phrase. The replacement string is then output from the file buffer 194 which 
contains the equivalent phrases, and is stored in the replacement string phrase buffer 196, for transmission 
over bus 160 and through the channel 146 to the terminal controller 134, for display on the display screen 137 
at the workstation 136. The operator can then elect whether to adopt the suggested replacement phrase being 

40 displayed on the display screen 137. The operator can make this election by entry to the keyboard 135, indi- 
cating the desired substitution of the replacement string for the existing input phrase. 

Although the disclosed embodiment in Figs. 7, 8 and 9 is in a host data processing system, the invention 
can alsofind application in smaller data processing systems, such as the IBM Personal Computer, Model 5160, 
for example. 

45 The resulting system for compaction and replacement of phrases provides an improved system for phrase 

replacement which is based upon the linguistic relationship between the input phrase and the replacement 
phrase. The information is compacted using a combination of character frequency encoding and a recognition 
of the linguistic regularities in natural languages. 



50 



Claims 



1. Process for the compaction and replacement of phrases with grammatically equivalent phrases conform- 
ing to conventional grammatical constraints of the original phrase, comprising: 
55 building reference list of pairs of phrases to correspond with each other, for enabling the replacement of 

a phrase with a grammatically equivalent phrase; 

scanning a text to be analyzed, to match a source phrase with a family of target phrases, employing a bit 
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quently used word and a proximate next-less frequently used word; 

continuing with said scanning until a match is obtained in said bit map, indicating that a suitable target 
phrase has been found; and 
displaying said target phrase. 

Process for compiling a phrase table of phrases with grammatically equivalent phrases conforming to con- 
ventional grammatical constraints of the original phrase, comprising: 

preparing a set of linguistically codified phrases which arranges the order of storage of the elements of 

the language with which grammatical relationships need to be established and which defines stored tables 

used to generate conjugations and linguistic variants; 

isolating the rarest used word as a file key for each codified phrase; 

creating a hash screen for each said file key; 

arranging the order of storage phrases in a phrase file by said file key; 

generating characteristic match terms from phrases and compacting said set of codified phrases based 
upon character frequency; 

storing said match terms for said compacted phrases and hash screen for reference by a linguistic decoder. 

Process for decoding phrase tables for replacing phrases with grammatically equivalent phrases conform- 
ing to conventional grammatical constraints of the original phrase, comprising the steps of: 
scanning an input text containing a plurality of phrases; 

hashing with a data processor a word selected from said plurality of input phrases; 

comparing said hashed word with a hash bit screen for a phrase file of phrases equivalent to said selected 

word; 

matching adjacent words to said selected word based upon linguistic rules relating said input phrase with 
said phrase derived from said phrase file; 

outputting a replacement phrase from said phrase file which satisfies said matching step. 
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FIG. 2 

EXAMPLE 



FIG. 2A 
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_J 

STORING A PLURALITY OF PHRASE -PAIR EXPRESSIONS IN THE MEMORY, 
EACH EXPRESSION INCLUDING A SOURCE PHRASE SEGMENT CONTAINING A 
VARIABLE SOURCE WORD ELEMENT AND A CONSTANT SOURCE WORD ELEMENT 
AND EACH EXPRESSION INCLUDING A REPLACEMENT PHRASE SEGMENT CON- 
TAINING A VARIABLE REPLACEMENT WORD ELEMENT AND A CONSTANT RE- 
PLACEMENT WORD ELEMENT 



STORING A SOURCE TABLE IN THE MEMORY , HAVING A PLURAL I T YOF 
SOURCE WORD ELEMENT VALUES ARRANGED INTO A PLURALITY OF RANKS 
HAVING A GRAMMATICALLY SIGNIFICANT SEQUENCE , THE VARIABLE SOURCE 
WORD ELEMENT IN A FIRST ONE OF THE PLURALITY OF PHRASE-PAIR 
EXPRESSIONS SERVING AS A POINTER FOR ACCESSING THE SOURCE TABLE 



STORING A REPLACEMENT TABLE IN THE MEMORY , HAVING A PLURALITY OF 
REPLACEMENT WORD ELEMENT VALUES ARRANGED INTO A PLURALITY OF 
RANKS HAVING A GRAMMATICALLY SIGNIFICANT SEQUENCE WITH VALUES IN 
EACH RANK OF THE REPLACEMENT TABLE BEING GRAMMATICALLY EQUIVALENT 
TO THE VALUES IN A CORRESPONDING RANK OF THE SOURCE TABLE , THE 
VARIABLE REPLACEMENT WORD ELEMENT IN THE FIRST ONE OF THE PHRASE - 
PAIR EXPRESSIONS SERVING AS A POINTER FOR ACCESSING THE REPLACE- 
MENT TABLE 



COMPARING IN THE EXECUTION UNIT, FIRST TARGET WORDS FROM THE 
INPUT WORD STREAM WITH THE CONSTANT SOURCE WORD ELEMENTS IN THE 
PLURALITY OF PHRASE-PAIR EXPRESSIONS 



ACCESSING THE FIRST ONE OF THE PHRASE-PAIR EXPRESSIONS HAVING A 
CONSTANT SOURCE WORD ELEMENT EQUAL TO A SELECTED ONE OF THE FIRST 
TARGET WORDS 



ACCESSING THE SOURCE TABLE WHICH IS POINTED TO BY THE VARUBLf 
SOURCE WORD ELEMENT IN THE FIRST ONE OF THE PHRASE-PAIR EXPRE: 
SIONS 



COMPARING EACH OF THE SOURCE WORD ELEMENT VALUES IN THE SOURCE 
TABLE WITH A SECOND TARGET WORD FROM THE INPUT WORD STREAM PROXI- 
MATE TO THE SELECTED ONE OF THE FIRST TARGET WORDS 



IDENTIFYING THE GRAMMATICALLY SIGNIFICANT RANK OF THE SOURCE WORD 
ELEMENT VALUE IN THE SOURCE TABLE WHICH IS EQUAL TO THE SECOND 
TARGET WORD 

ACCESSING THE REPLACEMENT TABLE WHICH IS POINTED TO BY THE VAR- 
IABLE REPLACEMENT WORD CLEMENT -JN THE FIRST ONE OF THE PHRASE - 
PAIR EXPRESSIONS 



ACCESSING FROM THE REPLACEMENT TABLE THE' GRAMMATICALLY EQUIVALENT 
REPLACEMENT WORD ELEMENT VALUE IN THE RANK OF THE REPLACEMENT 
TABLE WHICH CORRESPONDS TO THE GRAMMATICALLY SIGNIFICANT RANK 
IDENTIFIED IN THE SOURCE TABLE 



OUTPUTTING AN OUTPUT REPLACEMENT PHRASE TO THE OUTPUT UNIT , WHICH 
INCLUDES THE GRAMMATICALLY EQUIVALENT REPLACEMENT WORD ELEMENT 
VALUE AND THE CONSTANT REPLACEMENT WORD ELEMENT FROM THE FIRST 
ONE OF THE PHRASE-PAIR EXPRESSIONS 
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